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Abstract: CSA is a web server for the comprehensive comparison of pairwise protein structure 
alignments. Its exact alignment engine computes either optimal, top-scoring alignments or heuristic 
alignments with quality guarantee for the inter-residue distance based scorings of contact map 
overlap, PAUL, DALI and MATRAS. These and additional, uploaded alignments are compared 
using a number of quality measures and intuitive visualizations. CSA brings new insight into the 
structural relationship of the protein pairs under investigation and is a valuable tool for studying 
structural similarities. It is available at http : //csa . pro j ect . cwi . nl[ 

Key-words: Protein structure alignments, alignment comparison, web server, exact algorithm, 
scoring function 
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CSA : Comparaison comprehensible d'alignement de paires 

de structures de proteines 

Resume : CSA est un serveur web pour la comparaison comprehensible des alignements 
de paires de proteines. Son moteur d'alignement exact calcule, pour les scores de similarite 
bases sur les distances inter-residus CMO (Contact Map Overlap maximization), PAUL, DALI 
et MATRAS, soit des alignements optimaux (ayant le plus haut score), soit des alignements 
heuristiques avec une garantie de qualite. Ces alignements, plus ceux uploades par l'utilisateur, 
sont compares en utilisant de nombreuses mesures de qualite et des visualisations intuitives. CSA 
apporte un nouveau regard sur les relations structurales entre paires de proteines, et constitue 
un outil precieux pour l'etude des similarites structurales. CSA est disponible sur |http~ /csa. 
pro j ect . cwi .nl] 

Mots-cles : Alignement de structure proteiques, comparaison d' alignements, serveur web, 
algorithme exact, fonctions de score 



CSA: Comprehensive comparison of pairwise protein structure alignments 



3 



1 Introduction 

Protein structural alignment is a key method for answering many biological questions that in- 
volve the transfer of information from well-studied proteins to less well-known proteins. Since 
structures are more conserved during evolution than sequences, structural alignment allows for 
the most precise mapping of equivalent residues. It is notably important for (i) detecting and 
investigating structural motifs, functional sites, and common cores and (ii) measuring similarity 
between proteins and bringing them in evolutionary relationship, e.g., by classification. Numer- 
ous web servers are available that offer individual methods for computing structural alignments, 

e.g., us 1231 i2n nai. 

Many structure-based scoring schemes have been proposed and there is no consensus which 
scoring is the best [T2]. Comparative studies find that scorings have individual strengths and 
weaknesses and that alignments produced by different methods can differ considerably [22] . In the 
context of protein classification, there are first attempts to integrate information from alignments 
generated by different structural alignment methods, e.g., [6, 3 . 

Here, we present CSA (Comparative Structural Alignment), the first web server for com- 
prehensive comparison of pairwise protein structure alignments at single residue level. CSA 
facilitates evaluating the agreement between and advantages of alignments that maximize dif- 
ferent established scoring schemes. It offers the computation of alignments using the scoring 
schemes of DALI [TS], contact map overlap (cmo) [ID] , matras [H], and PAUL CSA uses 

our own, exact algorithm [TJ [25] that can be used with any inter-residue distance based scoring 
scheme. We choose CMO and PAUL scoring since they are tailored to the algorithm and dali and 
MATRAS scoring because they are established and their programs and web servers |181 HH] are 
widely used. CSA returns an optimal, i.e. top-scoring alignment, if found within the time limit, 
or otherwise an alignment with a quality guarantee that specifies how much improvement is at 
most possible. We denote this by calling our program and its alignments dalix and matrasx, 
in which X indicates exact. 

Optimality comes at the prize of higher running time, but is especially important when 
comparing alignments. A top-scoring, but biologically implausible, alignment implies that the 
scoring scheme used is inadequate to detect the given structural relationship and a different 
scoring might be more advisable. In the case of pairwise structural alignment, in which primar- 
ily residue correspondences are of interest, and only secondarily the obtained similarity score, 
comparing alignments optimized with respect to different criteria thus brings additional insight. 

In CSA, computed or uploaded alignments can be explored in terms of many inter-residue 
distance-, RMSD- and sequence-based scores and quality measures and with intuitive visualiza- 
tions such that agreements and differences between alignments are easy to grasp. The user can 
thus make educated decisions about the structural similarity of two proteins and, if necessary, 
post-process alignments by hand. Furthermore, a comparative analysis allows to differentiate be- 
tween proteins with one clear-case alignment on which various scorings agree and proteins with 
ambiguous alignments for which it depends on the application which alignment is preferable. 

2 Materials and methods 

2.1 Structural alignment algorithm 

The exact algorithm used in CSA is based on an integer linear programming (ILP) model of 
the structural alignment problem as described in [25]. Solutions to the ILP are generated using 
the approach from [1:. The algorithm combines branch-and-bound and Lagrangian relaxation, 
which can be seen as an iterative double dynamic programming method. The mathematical 
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model supports a generic scoring scheme with positive and negative structural scores, sequence 
scores and affine gap costs. Many different scoring functions are special cases of this general 
scheme. Currently, CSA supports DA LI [TS], CMO [TO] , matras [HJ, and PAUL [25] , 

2.2 Webserver implementation 

The architecture of the web server is divided in a processing layer that computes (C++) and 
evaluates (Python) alignments and an output layer, which generates W3C-validated HTML 
websites, interacts with the user and displays all information (PHP and Javascript). The interface 
between the two layers is a MySQL database. 

The alignment engine for all our four currently supported scoring schemes is identical and 
implemented in CH — h as a stand-alone program. User-adjustable parameters are the time limit 
of the computation, the maximum number of branch-and-bound nodes, and the number of La- 
grangian iterations in each node. Furthermore, each scoring scheme has different parameters, for 
example, the use of C Q or Cp inter-residue distances. 

Computed or user-uploaded (in FASTA format) alignments are read into a Python class and 
subsequently written to the MySQL data base. A second Python class handles the computation 
of different scores. It obtains the required structural information from the PDB files with the 
help of the Biopython package Bio. PDB [11 . Tasks related to superpositioning are also handled 
by this package. Visualizations of distance and distance difference matrices are generated using 
the Python Imaging Library. 

The website functions have been implemented in separate modules, which makes it easy to 
integrate additional structural alignment methods. The modularity is illustrated by the use of a 
tab menu. All web server functions are extensively documented, which is denoted by a question 
mark next to the respective section titles or table headers. Additionally, a documentation puts 
instructions and explanations into context. Notably, we documented all structural alignment 
scorings that are used within CSA and we provide the corresponding formulas and references. In 
the output layer, structures and their superpositions are visualized in Jmol (http://www.jmol. 
|org[ | and images are generated using the PHP package pChart ( [http://www.pchart.net/ I. 

3 Case studies 

We illustrate the functionality of CSA using two case studies which are accessible from its main 
page via the links "Example 1" and "Example 2". 

3.1 Benefits of visualization and comparison 

The first case study deals with two proteins from the SISY data set |22l 0] , ubiquitin-binding 
protein CUE2 (PDB ID lotr, chain A, 49 residues) and the CUE domain of activating signal 
cointegrator 1 complex subunit 2 (PDB ID 2di0, chain A, 71 residues). The proteins belong to 
the SISYPHUS [2\ alignment AL00088995 of homologous proteins containing a CUE domain. 
The CUE domain is composed of a three helical bundle and it consists of 41 residues. It binds 
ubiquitin and is involved in protein degradation. 

After specifying PDB IDs and chains on the main page of CSA, the user is redirected to the 
CSA evaluation environment. It is organized in tabs for the following tasks: overview on the 
protein structures, computing alignments using CMO, PAUL, DALI or matras scoring, upload of 
external alignments, and the comparison of alignments. 

The Structures tab lists PDB IDs, PDB file names, selected chains and their lengths and 
amino acid sequences. Links to the PDB [5| and to iHOP [13_ are access points for additional 
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information concerning the proteins and their function. Protein structures are visualized in Jmol. 
Their C Q and Cp distance matrices and contact maps are visualized. 

We compute CMO, PAUL, DALIX and MATRASX alignments using the default options, i.e., with 
a time limit of 30 CPU s. The setup of all four result pages is identical. Exemplary, we consider 
the CMO alignment page; parts of it are displayed in Figure [T] 

Bounds on alignment score and similarity. The section Optimized score lists the resulting 
scores: the raw score s(A, B) of proteins A and B (here, the number of common contacts), and 
a similarity score that normalizes the raw score with respect to the self-similarity of the two 
proteins computed as s^^j+^g b) • ^ ur exac t algorithm returns lower an upper bounds {LB 
and UB) on the raw and similarity scores. Based on these bounds, the relative gap in percent, 
100 • UB LB LB , quantifies by how many percent an alignment can at most be improved. Such 
a quality guarantee helps to quickly determine the progress of the computation as well as the 
similarity of the two proteins. If two proteins are dissimilar, the relative gap tends to be large, 
but the upper bound on the similarity score tends to be low from the beginning on. Aligning 
lotrA with 2diOA w.r.t. CMO yields 125 common contacts, and the corresponding similarity score 
on a scale from to 1 is 0.751. The relative gap is 0%, indicating that the top-scoring alignment 
has been found. 

Structural conservation and variation. The Alignment section displays the computed 
structural alignment. Residues are color-coded according to either SSE (helix, sheet, coil) as 
assigned by DSSP |17| or to residue pair score contribution. The second color-coding denotes 
how well single residue pairs are structurally conserved given the current alignment, cf. Fig. [I] 
For the two proteins containing the CUE domain, this indicates that the first identically aligned 
leucines are structurally conserved, and in fact this position is part of a motif for binding ubiqui- 
tin that consists of an invariant proline and two highly conserved leucines [22] . Pairs of aligned 
residues with low score contribution highlight structural variations. In the CMO alignment, the 
N- and C-terminal regions are little structurally conserved, as well as the residues in the region of 
the invariant proline within the CUE domain, because the proline is located in a turn. Such a vi- 
sualization of residue score contribution can hint towards a manual modification of the alignment 
by removing aligned residues with low score. In fact, this is what happens in the top-scoring 
DALIX alignment of lotrA and 2diOA, in which the four C-terminal residues with low CMO score 
are excluded from the alignment. 

Comprehensive alignment-related data. Additional to the alignment, CSA displays the 
aligned segments, both using sequential and PDB residue numbering, cf. Fig. [T] Numerous raw 
alignment- and similarity scores are listed, for example the number of aligned residues, sequence 
identity and root mean square deviation (RMSD). Furthermore, some statistics concerning the 
alignment computation are given. These are the number of residues and inter-residue distances 
considered during computation. They greatly influence the memory consumption of the algo- 
rithm: the more inter-residue distances are considered, the more memory is needed and typically 
the larger the running time. Using default values, CMO only considers distances smaller than 7.5 
A, PAUL considers distances smaller than 8.5 A (for C Q distances, 9.5 A), MATRAS uses distances 
up to 50 A and DALI all distances. Because server memory is shared among users, we currently 
restrict computations using the DALI or MATRAS scoring to protein pairs with average length 
less than 150 residues. The allocation time for setting up all data structures is given, as well as 
the time actually spent on computing the alignment. The number of visited branch-and-bound 
nodes gives a good estimate on the progress of the computation. The proteins are superposed 



RR n° 7874 



G 



Wohlers, Malod-Dognin, Andonov and Klau 



according to the alignment and visualized in Jmol. The trace of aligned residues and the distance 
difference matrix is plotted. 

We upload an additional alignment in the tab for the first custom alignment. This alignment 
aligns only the 38 residues that belong to the respective CUE domain and that are structurally 
equivalent according to SISY. Furthermore, we upload a second custom alignment which has 
been generated by the dali server [T3] . The DA LI server uses a heuristic algorithm to find a good 
alignment according to the DALI score. 

Improving, verifying optimality and assessing quality of heuristic alignments. Many 
different scorings and quality measures can be compared in the Comparison tab: the CMO, 
PAUL, dali, and MATRAS raw and similarity scores, DALI z-score [IB], TM-score [57], number 
and percentage of aligned residues, coordinate and distance RMSD, RMSD100 [7J, and sequence 
identity. For lotrA and 2diOA, all six computed and uploaded alignments differ from each 
other. While CMO and PAUL alignment were computed to optimality in less than a second, the 
DALIX alignment has the potential to be improved by up to 12% and the MATRASX alignment 
by up to 24%. We also observe that the alignment that was computed by the DALI server and 
then uploaded is better with respect to DALI score than the alignment computed by our exact 
algorithm within 30s. We thus increase the maximum running time for dalix and matrasx to 10 
minutes. Now, both alignments are computed to provable optimality and our top-scoring dalix 
alignment slightly improves the heuristic solution returned by the DALI server, dalix and MAT- 
RASX alignments thus can be used to obtain quality guarantees for DALI or matras alignments 
and in some cases also to either proof their optimality or to compute a better alignment. 

Multi-criteria comparison and selecting a sound alignment. Alignment trace compari- 
son as introduced in |!J] gives a visual overview about agreements and differences between align- 
ments. Here, any subset of alignments can be shown. Using this visualization, we find that all 
alignments (except the SISY reference, which excludes 3 residues in the center of the domain) 
correctly align all 41 residues of the CUE2 domain, and that they differ in aligning the neigh- 
boring N- and C-terminal residues. A radar chart compares the different scores, cf. Fig. [2] This 
chart helps to quantify score differences and allows to decide whether one alignment is clearly 
preferable, i.e., better with respect to all criteria. The chart also allows to make an intuitive 
decision which alignment is most appropriate in cases in which different scorings disagree as it 
is the case for lotrA and 2diOA. Here, intuitively the DALIX alignment is the best choice since it 
performs good or best according to all criteria. 

Two residue pair lists show aligned residues that appear in all, resp. in the majority, of the 
alignments. They each constitute a consensus alignment. In the case of aligning lotrA and 
2diOA, we see that such a consensus is useful: all alignments only agree in aligning the CUE2 
domain. The consensus thus highlights the structurally conserved and biologically relevant region 
of the alignment. 

3.2 Alignment of flexible proteins 

We illustrate the usefulness of comparing structural alignments in the case of protein flexibility. 
This is a challenge for most structural alignment methods because flexible proteins typically do 
not superpose well unless the flexibility is accommodated for, e.g., by explicitly introducing a 
hinge. 

Comparing flexible and rigid scoring schemes. We align two conformations of the calmod- 
ulin protein (PDB IDs 4cln, chain A and 2bbm, chain A, with a length of 148 residues). In 
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structure 4clnA, calmodulin is bound to a ligand, in 2bbmA it is unbound. In bound confor- 
mation, a central helix is split and the components at the ends of the helix arc moved towards 
each other. We align the two conformations using CMO and PAUL. We furthermore upload the 
alignments computed by TM- ALIGN [25], an algorithm maximizing the TM-score, and DAST |20| . 
a local structural alignment method that determines the longest alignment with distance RMSD 
less than 4 A. We find that both CMO and PAUL correctly align the two conformations over 
their entire length. To keep the comparison concise, we thus exclude the CMO alignment from 
the following visualizations. Figure [3] displays the two conformers superposed according to the 
TM-alignment and the alignment trace comparison. While PAUL aligns all residues of the two 
conformers correctly tm-ALIGN aligns only the C-terminal, rigidly superposable region (except 
the C-terminal residue). DAST also aligns the C-terminal region, but excludes and shifts fur- 
ther residues from the alignment. The radar chart comparing the different scores as well as the 
distance difference matrices displayed in Figure [3] show why: while CMO, PAUL, DA LI and mat- 
RAS scoring by far favor the alignment of the entire conformers, TM-score as well as RMSD100 
clearly favor the TM- and DAST alignment, which has a much smaller RMSD, but aligns only 
the C-terminal region. 

Detecting flexibility and hinges. For each alignment we display the distance difference 
matrix. This is a symmetric square matrix with entries \dfj — dfA at position (i,j), where i is 
the i-ih aligned position and j the j-ih aligned position. Here, distance differences are visualized 
using a color gradient in which A is colored red, 2.5 A green 5 A blue. Regions with low 
inter-residue distance differences correspond to rigidly superposable fragments. For the PAUL 
alignment of 4clnA and 2bbmA, red blocks in the distance difference matrix indicate that both 
the N-terminal and C-terminal regions can be superpositioned very precisely. The distance 
differences between these two regions, however, are large, denoted by the blocks in blue color. 
The two regions can thus only be well superpositioned individually. A hinge is present at the 
residue bordering the two blocks (position 80) [8J. TM- ALIGN and DAST align only the C-terminal 
region, thus avoiding any large distance differences. DAST is more restrictive in excluding large 
distance differences, it does not align a few residues that are still aligned by the TM-alignment 
and which have distance differences larger than 5 A, colored in blue. 

Scores as CMO and PAUL, which implicitly ignore RMSD, are useful to gain information about 
flexible regions. While this feature is beneficial for flexible proteins it may also introduce flexi- 
bility where this is not appropriate. Protein similarities consisting in compact, well superposable 
fragments are therefore often better detected by maximizing scores like the TM- or the DAST 
score. 

4 Conclusion 

Different structural alignment scoring functions have different strengths and weaknesses. Which 
scoring to use depends on the application and on the structural relationship of the investigated 
proteins. Their different focus on handling various aspects of structural similarity is one reason 
why there are many different structural alignment scorings and programs and no consensus which 
combination is best. 

We therefore consider it beneficial to compute alignments using different scoring schemes and 
algorithms and to compare them in order to gain insight into their structural relationship. The 
CSA web server provides the tools for such a comparison. CSA allows to compute alignments 
with various scorings, returns a quality guarantee for the alignments and enables the user to 
additionally evaluate and compare uploaded alignments. In the most common case in which 
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scorings and alignments disagree, it facilitates evaluating the agreement and differences between 
them and selecting the most suitable alignment. 
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Figure 1 : Parts of the information displayed on the website for the CMO alignment of lotrA and 
2diOA. 
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Figure 2: A radar chart for comparison of alignment scores for six different alignments of lotrA 
and 2diOA. The closer a point is to 1, the better the corresponding score. CMO, PAUL, dalix 
and MATRASX alignments have been computed by our exact algorithm and are provably optimal 
concerning their respective score. The SISY reference alignment aligns 38 residues of the CUE2 
domain. The dali alignment was computed by the DALI server and has slightly lower dali score 
than the optimal DALIX alignment. The reference alignment is far behind in all scores except 
RMSD100 and TM-score, for which it performs quite well. The MATRASX alignment performs 
especially poor for these two measures. Intuitively, the dalix alignment is most preferable since 
it has optimal dali and close to optimal CMO, PAUL and matras scores, as well as the best 
TM-score and RMSD100. 
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Figure 3: Top left: The two calmodulin conformers (PDB IDs 4cln and 2bbm) superpositioned 
according to the TM-alignment, which aligns only one of the two regions that move relative 
to each other. Top right: Comparison of the alignment traces. Each axis corresponds to one 
conformer. Black boxes denote residue pairs aligned by all three scorings, PAUL, tm-ALIGN and 
dast. Light gray denotes residue pairs aligned by only one scoring. An intermediate shade of gray 
denotes agreement of two scorings. PAUL aligns all residues of the two conformers, tm-ALIGN and 
dast the C-terminal region. Center: The radar chart illustrates the difference between scorings 
that are more in favor of a flexible alignment, i.e. CMO, PAUL, DALI and MATRAS, and scorings 
that are more in favor of a rigid superposition of low RMSD. Bottom: The distance difference 
matrices illustrate the difference between the flexible PAUL alignment, that aligns all residues 
in spite of large distance differences (colored blue), and the TM- and DAST alignment, which 
exclude large distance differences, but only align the C-terminal region. 
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